This is mostly important if you want to read local files (the computer will in your current directory to find stuff) If the computer cannot find the files there, it will of course throw you an error message
getwd()
## [1] "/Users/finnlo/Documents/Github/Resources/R_Introduction"
setwd("/Users/finnlo/Documents/Github/Resources/R_Introduction/") # my personal path on my computer
Follow instructions on this site to connect R with github.
This may take some time and nerves, but it’s worth it (in my opinion)!
Connecting allows you to upload files to github directly in R.
Every package can be installed using install.packages(‘xxx’).
Every package can be loaded by using library(xxx).
Here we will check if you have all the packages installed for the tutorial:
library(tidyverse) #install.packages('tidyverse')
library(data.table) #install.packages('data.table') # for the setnames() function
library(tibble) #install.packages('tibble') # to use tibbles
library(readr) #install.packages('readr') # to read tsv files
library(xlsx) #install.packages('readxl') # to read xlsx files
library(knitr) #install.packages('knitr') # for making tables in markdown
library(visdat) #install.packages('visdat') # for looking at missing data
library(scales) #install.packages('scales') # for the percent() functionIn R you will encounter data of different classes:
Logical (e.g. TRUE/T or FALSE/F)
Numeric (num, e.g. 1, 2.0, 3, 44.5, 100)
Character (chr, e.g. ‘Hello’, ‘a’, ‘13.4’, ‘True’)
class(165)
## [1] "numeric"
class('Hello')
## [1] "character"
class(T)
## [1] "logical"
class(FALSE)
## [1] "logical"Vector = single row of data, same class
Vector <- c(1:100)
Vector2 <- Vector*2
Vector + Vector2
## [1] 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54
## [19] 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99 102 105 108
## [37] 111 114 117 120 123 126 129 132 135 138 141 144 147 150 153 156 159 162
## [55] 165 168 171 174 177 180 183 186 189 192 195 198 201 204 207 210 213 216
## [73] 219 222 225 228 231 234 237 240 243 246 249 252 255 258 261 264 267 270
## [91] 273 276 279 282 285 288 291 294 297 300
# you can subset your vectors as well
Days_Vector <- c("Mon", "Tue", "Wed", "Thurs", "Fri")
Days_Vector[1] # just Monday
## [1] "Mon"
Days_Vector[2:5] # Tue - Friday
## [1] "Tue" "Wed" "Thurs" "Fri"
Days_Vector[-5] # all but Friday
## [1] "Mon" "Tue" "Wed" "Thurs"
Days_Vector[10] # not available
## [1] NALists = single row of data, but DIFFERENT data types possible
List <- c(1, 'apple', T, 'BANANA')
List
## [1] "1" "apple" "TRUE" "BANANA"matrix = 2D version of a vector
data frame = many vectors pasted together as columns
rm(Days_Vector, List, Vector, Vector2) # remove one or more things (lists, vectors, data frames) from your environment
There are tons of functions in Base R (meaning that they are not in extra packages you need to load) Here are just a few that might come in handy!
Don’t forget, if you ever need help with a function and how to use it: type ?function and the help page will pop up
Values <- 1:100
?mean
mean(Values)
median(Values)
min(Values)
max(Values)
sum(Values)
sd(Values)
class(Values)
length(Values)
log(Values) # natural log by default
log10(Values)
mysqrt <- sqrt(Values)
Read the same table with different file types and make sure you get the same result for all of them: (preferably you make your life easier and have csv files)
To make sure that the table we add is correctly addressing the NA-situation, we add the na.strings = c(…) or equivalents to each of the read functions (beware, there is no such thing for reading xlsx, so you need to take care of this later)
library(readr)
library(xlsx)
dirty_csv <- read.csv('https://raw.githubusercontent.com/tlobnow/Resources/main/R_Introduction/dirty_files/Dirty_Data.csv', skip = 2, na.strings = c('', ' ', 'NA'))
dirty_tsv <- read_tsv('https://raw.githubusercontent.com/tlobnow/Resources/main/R_Introduction/dirty_files/Dirty_Data.tsv', skip = 2, na = c('', ' ', 'NA'))
# I will load the .xlsx file locally, so be aware of your workding directory:
dirty_xlsx <- read.xlsx('dirty_files/Dirty_Data.xlsx', sheetIndex = 1, startRow = 3)
rm(dirty_tsv, dirty_xlsx)
# I just want you to see that you CAN open different file types,
# but we will remove the other files and work with the data frame we made from the csv file
We want all column names to have an uppercase first letter, rest can be lower case (Xxxx).
You can do this by either correcting all in one or change them individually.
For the ‘all in one’ version you have to load the ‘data.table’ package and follow this pattern:
setnames(df, old = c("old1", "old2", "old3"), new = c("new1", "new2", "new3"), skip_absent = T)
Alternatively, you can change the colnames individually, for example using this pattern:
colnames(df)[colnames(df)%in%"Colname1_old"] <- "Colname1_new"
library(data.table)
# all at once
setnames(dirty_csv, old = c("Student", "gender", "AGE", "fruit", "sport"), new = c("Student_ID", "Gender", "Age", "Fav_Fruit", "Fav_Sport"), skip_absent = T)
# individually
colnames(dirty_csv)[colnames(dirty_csv) %in% 'Student'] <- 'Student_ID'
colnames(dirty_csv)[colnames(dirty_csv) %in% 'gender'] <- 'Gender'
colnames(dirty_csv)[colnames(dirty_csv) %in% 'AGE'] <- 'Age'
colnames(dirty_csv)[colnames(dirty_csv) %in% 'fruit'] <- 'Fav_Fruit'
colnames(dirty_csv)[colnames(dirty_csv) %in% 'sport'] <- 'Fav_Sport'
kable(dirty_csv, align = 'ccccccccccccccccc')
| Student_ID | Name | Gender | Age | Fav_Fruit | Fav_Sport | Do_you_like_Math | Fav_Animal | Last_Number_ZIP |
|---|---|---|---|---|---|---|---|---|
| Student_001 | Jimmy | male | 10 | apple | football | yes | Bunny | 1 |
| Student_002 | Mathilde | FEMALE | 11 | coconut | soccer | no | NA | 2 |
| Student_003 | laura | NA | 10 | dragon fruit | ballet | yes | Horse | 3 |
| Student_004 | selma | m | 9 | lemon | swimming | yes and no | NA | 4 |
| Student_005 | ELISA | fem | 10 | apple | ballet | no | Horse | 1 |
| Student_006 | Andreas | tennis | 1000 | banana | MALE | no not really | NA | 2 |
| Student_007 | Tom | m | 9 | pear | swimming | no | Tiger | 2 |
| Student_00eight | Lily | f | ten | orange | ballet | yes (don’t tell mom) | NA | 4 |
| Student_009 | max | mal | 11 | cherry | karate | yes | NA | 6 |
| Student_0010 | Priscilla | m | 12 | soccer | strawberry | yes | Dog | 8 |
| NA | Thomas | m | 13 | soccer | melon | no | NA | 8 |
| Student_021 | Liam | m | 10 | apple | soccer | yes sometimes | Cat | 6 |
| Student_022 | Leanna | f | 11 | banana | basketball | no not really | Cat | 7 |
| Student_023 | Jodi | f | 12 | orange | tennis | yes | Dog | 6 |
| Student_024 | Tammi | f | 10 | banana | hockey | no | Lion | 3 |
| Student_025 | Elliot | male | 1100 | cherry | soccer | no | Tiger | 2 |
| Student_026 | Lilian | f | 1200 | pear | table tennis | no | Bobcat | 2 |
| Student_027 | Noah | mal | 10 | apple | tennis | yes | Whale | 1 |
| Student_028 | Jeanie | f | 11 | strawberry | ballet | yes | Unicorn | 6 |
| Student_029 | Louis | m | 12 | orange | swimming | no | Wolf | 4 |
| Student_030 | Therese | FEMALE | 13 | blood orange | karate | yes | Turtle | 5 |
| NA | Elias | NA | 10 | NA | basketball | NA | NA | 1 |
Let’s look at two ways to do this: you can either select the column and change input for several rows at once:
df$colname[df$colname %in% c("old1", "old2")] <- c("new1", "new2")
Or you can change the row input using the gsub function, which is generally better when you also look for a pattern, but it can be an individual row input like in our case (generally the first option is better, but both get the job done!):
df$colname <- gsub(pattern = "old1", replacement = "new1", x = df$colname)
# all in one
dirty_csv$Student_ID[dirty_csv$Student_ID %in% c("Student_00eight", "Student_0010")] <- c("Student_008", "Student_010")
dirty_csv$Student_ID[ dirty_csv$Name %in% "Thomas"] <- 'Student_011'
# individually
dirty_csv$Student_ID <- gsub(pattern = "Student_00eight", replacement = "Student_008", x = dirty_csv$Student_ID)
dirty_csv$Student_ID <- gsub(pattern = "Student_0010", replacement = "Student_010", x = dirty_csv$Student_ID)
dirty_csv$Student_ID[ dirty_csv$Name %in% "Thomas"] <- 'Student_011'
kable(dirty_csv, align = 'ccccccccccccccccc')
| Student_ID | Name | Gender | Age | Fav_Fruit | Fav_Sport | Do_you_like_Math | Fav_Animal | Last_Number_ZIP |
|---|---|---|---|---|---|---|---|---|
| Student_001 | Jimmy | male | 10 | apple | football | yes | Bunny | 1 |
| Student_002 | Mathilde | FEMALE | 11 | coconut | soccer | no | NA | 2 |
| Student_003 | laura | NA | 10 | dragon fruit | ballet | yes | Horse | 3 |
| Student_004 | selma | m | 9 | lemon | swimming | yes and no | NA | 4 |
| Student_005 | ELISA | fem | 10 | apple | ballet | no | Horse | 1 |
| Student_006 | Andreas | tennis | 1000 | banana | MALE | no not really | NA | 2 |
| Student_007 | Tom | m | 9 | pear | swimming | no | Tiger | 2 |
| Student_008 | Lily | f | ten | orange | ballet | yes (don’t tell mom) | NA | 4 |
| Student_009 | max | mal | 11 | cherry | karate | yes | NA | 6 |
| Student_010 | Priscilla | m | 12 | soccer | strawberry | yes | Dog | 8 |
| Student_011 | Thomas | m | 13 | soccer | melon | no | NA | 8 |
| Student_021 | Liam | m | 10 | apple | soccer | yes sometimes | Cat | 6 |
| Student_022 | Leanna | f | 11 | banana | basketball | no not really | Cat | 7 |
| Student_023 | Jodi | f | 12 | orange | tennis | yes | Dog | 6 |
| Student_024 | Tammi | f | 10 | banana | hockey | no | Lion | 3 |
| Student_025 | Elliot | male | 1100 | cherry | soccer | no | Tiger | 2 |
| Student_026 | Lilian | f | 1200 | pear | table tennis | no | Bobcat | 2 |
| Student_027 | Noah | mal | 10 | apple | tennis | yes | Whale | 1 |
| Student_028 | Jeanie | f | 11 | strawberry | ballet | yes | Unicorn | 6 |
| Student_029 | Louis | m | 12 | orange | swimming | no | Wolf | 4 |
| Student_030 | Therese | FEMALE | 13 | blood orange | karate | yes | Turtle | 5 |
| NA | Elias | NA | 10 | NA | basketball | NA | NA | 1 |
We will now talk about piping:
You can use the pipe symbol %>% to start with data, narrow it down, change it, add columns, etc.
For that, you have to specify the data you start with and tell R where you want to save the piped things (assign object).
This object can be the same data frame you start with, but it in that case you will override the data from before.
df <- df %>% bla bla bla
If you want to keep the data frame as it was before, you must assign to a new object (make it descriptive, but short)
df_1 <- df %>% bla bla bla
The way that you can find NAs is by using is.na()
If you want to find all row inputs of a column that are NOT NA, then you can use !is.na()
The ‘!’ is useful in many cases to do the opposite of said call, e.g.:
# exclude Student_IDs that have no assigned name, not useful!
dirty_csv <- dirty_csv %>% filter(!is.na(Student_ID))
# change the student names as we have learned in previous steps:
dirty_csv$Name[dirty_csv$Name %in% c("laura", "selma", "ELISA", "max")] <- c("Laura", "Selma", "Elisa", "Max")
kable(dirty_csv, align = 'ccccccccccccccccc')
| Student_ID | Name | Gender | Age | Fav_Fruit | Fav_Sport | Do_you_like_Math | Fav_Animal | Last_Number_ZIP |
|---|---|---|---|---|---|---|---|---|
| Student_001 | Jimmy | male | 10 | apple | football | yes | Bunny | 1 |
| Student_002 | Mathilde | FEMALE | 11 | coconut | soccer | no | NA | 2 |
| Student_003 | Laura | NA | 10 | dragon fruit | ballet | yes | Horse | 3 |
| Student_004 | Selma | m | 9 | lemon | swimming | yes and no | NA | 4 |
| Student_005 | Elisa | fem | 10 | apple | ballet | no | Horse | 1 |
| Student_006 | Andreas | tennis | 1000 | banana | MALE | no not really | NA | 2 |
| Student_007 | Tom | m | 9 | pear | swimming | no | Tiger | 2 |
| Student_008 | Lily | f | ten | orange | ballet | yes (don’t tell mom) | NA | 4 |
| Student_009 | Max | mal | 11 | cherry | karate | yes | NA | 6 |
| Student_010 | Priscilla | m | 12 | soccer | strawberry | yes | Dog | 8 |
| Student_011 | Thomas | m | 13 | soccer | melon | no | NA | 8 |
| Student_021 | Liam | m | 10 | apple | soccer | yes sometimes | Cat | 6 |
| Student_022 | Leanna | f | 11 | banana | basketball | no not really | Cat | 7 |
| Student_023 | Jodi | f | 12 | orange | tennis | yes | Dog | 6 |
| Student_024 | Tammi | f | 10 | banana | hockey | no | Lion | 3 |
| Student_025 | Elliot | male | 1100 | cherry | soccer | no | Tiger | 2 |
| Student_026 | Lilian | f | 1200 | pear | table tennis | no | Bobcat | 2 |
| Student_027 | Noah | mal | 10 | apple | tennis | yes | Whale | 1 |
| Student_028 | Jeanie | f | 11 | strawberry | ballet | yes | Unicorn | 6 |
| Student_029 | Louis | m | 12 | orange | swimming | no | Wolf | 4 |
| Student_030 | Therese | FEMALE | 13 | blood orange | karate | yes | Turtle | 5 |
To change the gender accordingly, we will use the case_when() function:
This is a function that allows you to adjust/make columns according to different conditions.
df <- df %>% mutate(colname = casewhen(colname == 'old_pattern1' ~ 'new_pattern1',
colname == 'old_pattern2' ~ 'new_pattern2'))
dirty_csv <- dirty_csv %>% mutate(Gender = case_when(Gender == 'f' ~ 'f',
Gender == 'FEMALE' ~ 'f',
Gender == 'fem' ~ 'f',
is.na(Gender) ~ 'f',
Gender == 'm' ~ 'm',
Gender == 'mal' ~ 'm',
Gender == 'male' ~ 'm',
Gender == 'tennis' ~ 'm'))
# don't forget that the last input (tennis) was mistakenly in this column and should be added to Fav_Sport later!
dirty_csv$Gender[ dirty_csv$Name %in% 'Priscilla'] <- 'f'
kable(dirty_csv, align = 'ccccccccccccccccc')
| Student_ID | Name | Gender | Age | Fav_Fruit | Fav_Sport | Do_you_like_Math | Fav_Animal | Last_Number_ZIP |
|---|---|---|---|---|---|---|---|---|
| Student_001 | Jimmy | m | 10 | apple | football | yes | Bunny | 1 |
| Student_002 | Mathilde | f | 11 | coconut | soccer | no | NA | 2 |
| Student_003 | Laura | f | 10 | dragon fruit | ballet | yes | Horse | 3 |
| Student_004 | Selma | m | 9 | lemon | swimming | yes and no | NA | 4 |
| Student_005 | Elisa | f | 10 | apple | ballet | no | Horse | 1 |
| Student_006 | Andreas | m | 1000 | banana | MALE | no not really | NA | 2 |
| Student_007 | Tom | m | 9 | pear | swimming | no | Tiger | 2 |
| Student_008 | Lily | f | ten | orange | ballet | yes (don’t tell mom) | NA | 4 |
| Student_009 | Max | m | 11 | cherry | karate | yes | NA | 6 |
| Student_010 | Priscilla | f | 12 | soccer | strawberry | yes | Dog | 8 |
| Student_011 | Thomas | m | 13 | soccer | melon | no | NA | 8 |
| Student_021 | Liam | m | 10 | apple | soccer | yes sometimes | Cat | 6 |
| Student_022 | Leanna | f | 11 | banana | basketball | no not really | Cat | 7 |
| Student_023 | Jodi | f | 12 | orange | tennis | yes | Dog | 6 |
| Student_024 | Tammi | f | 10 | banana | hockey | no | Lion | 3 |
| Student_025 | Elliot | m | 1100 | cherry | soccer | no | Tiger | 2 |
| Student_026 | Lilian | f | 1200 | pear | table tennis | no | Bobcat | 2 |
| Student_027 | Noah | m | 10 | apple | tennis | yes | Whale | 1 |
| Student_028 | Jeanie | f | 11 | strawberry | ballet | yes | Unicorn | 6 |
| Student_029 | Louis | m | 12 | orange | swimming | no | Wolf | 4 |
| Student_030 | Therese | f | 13 | blood orange | karate | yes | Turtle | 5 |
dirty_csv$Age[dirty_csv$Age %in% 'ten'] <- 10
dirty_csv$Age <- as.numeric(dirty_csv$Age)
dirty_csv <- dirty_csv %>% mutate(Age = ifelse(Age > 20, # logical expression
Age / 100, # what happens if the logical condition is true
Age)) # what happens if the logical condition is false
kable(dirty_csv, align = 'ccccccccccccccccc')
| Student_ID | Name | Gender | Age | Fav_Fruit | Fav_Sport | Do_you_like_Math | Fav_Animal | Last_Number_ZIP |
|---|---|---|---|---|---|---|---|---|
| Student_001 | Jimmy | m | 10 | apple | football | yes | Bunny | 1 |
| Student_002 | Mathilde | f | 11 | coconut | soccer | no | NA | 2 |
| Student_003 | Laura | f | 10 | dragon fruit | ballet | yes | Horse | 3 |
| Student_004 | Selma | m | 9 | lemon | swimming | yes and no | NA | 4 |
| Student_005 | Elisa | f | 10 | apple | ballet | no | Horse | 1 |
| Student_006 | Andreas | m | 10 | banana | MALE | no not really | NA | 2 |
| Student_007 | Tom | m | 9 | pear | swimming | no | Tiger | 2 |
| Student_008 | Lily | f | 10 | orange | ballet | yes (don’t tell mom) | NA | 4 |
| Student_009 | Max | m | 11 | cherry | karate | yes | NA | 6 |
| Student_010 | Priscilla | f | 12 | soccer | strawberry | yes | Dog | 8 |
| Student_011 | Thomas | m | 13 | soccer | melon | no | NA | 8 |
| Student_021 | Liam | m | 10 | apple | soccer | yes sometimes | Cat | 6 |
| Student_022 | Leanna | f | 11 | banana | basketball | no not really | Cat | 7 |
| Student_023 | Jodi | f | 12 | orange | tennis | yes | Dog | 6 |
| Student_024 | Tammi | f | 10 | banana | hockey | no | Lion | 3 |
| Student_025 | Elliot | m | 11 | cherry | soccer | no | Tiger | 2 |
| Student_026 | Lilian | f | 12 | pear | table tennis | no | Bobcat | 2 |
| Student_027 | Noah | m | 10 | apple | tennis | yes | Whale | 1 |
| Student_028 | Jeanie | f | 11 | strawberry | ballet | yes | Unicorn | 6 |
| Student_029 | Louis | m | 12 | orange | swimming | no | Wolf | 4 |
| Student_030 | Therese | f | 13 | blood orange | karate | yes | Turtle | 5 |
dirty_csv$Fav_Fruit[dirty_csv$Name %in% "Thomas"] <- 'melon'
dirty_csv$Fav_Fruit[dirty_csv$Name %in% "Priscilla"] <- 'strawberry'
kable(dirty_csv, align = 'ccccccccccccccccc')
| Student_ID | Name | Gender | Age | Fav_Fruit | Fav_Sport | Do_you_like_Math | Fav_Animal | Last_Number_ZIP |
|---|---|---|---|---|---|---|---|---|
| Student_001 | Jimmy | m | 10 | apple | football | yes | Bunny | 1 |
| Student_002 | Mathilde | f | 11 | coconut | soccer | no | NA | 2 |
| Student_003 | Laura | f | 10 | dragon fruit | ballet | yes | Horse | 3 |
| Student_004 | Selma | m | 9 | lemon | swimming | yes and no | NA | 4 |
| Student_005 | Elisa | f | 10 | apple | ballet | no | Horse | 1 |
| Student_006 | Andreas | m | 10 | banana | MALE | no not really | NA | 2 |
| Student_007 | Tom | m | 9 | pear | swimming | no | Tiger | 2 |
| Student_008 | Lily | f | 10 | orange | ballet | yes (don’t tell mom) | NA | 4 |
| Student_009 | Max | m | 11 | cherry | karate | yes | NA | 6 |
| Student_010 | Priscilla | f | 12 | strawberry | strawberry | yes | Dog | 8 |
| Student_011 | Thomas | m | 13 | melon | melon | no | NA | 8 |
| Student_021 | Liam | m | 10 | apple | soccer | yes sometimes | Cat | 6 |
| Student_022 | Leanna | f | 11 | banana | basketball | no not really | Cat | 7 |
| Student_023 | Jodi | f | 12 | orange | tennis | yes | Dog | 6 |
| Student_024 | Tammi | f | 10 | banana | hockey | no | Lion | 3 |
| Student_025 | Elliot | m | 11 | cherry | soccer | no | Tiger | 2 |
| Student_026 | Lilian | f | 12 | pear | table tennis | no | Bobcat | 2 |
| Student_027 | Noah | m | 10 | apple | tennis | yes | Whale | 1 |
| Student_028 | Jeanie | f | 11 | strawberry | ballet | yes | Unicorn | 6 |
| Student_029 | Louis | m | 12 | orange | swimming | no | Wolf | 4 |
| Student_030 | Therese | f | 13 | blood orange | karate | yes | Turtle | 5 |
dirty_csv$Fav_Sport[dirty_csv$Fav_Sport %in% c("MALE", "strawberry", 'melon')] <- c("tennis", "soccer", "soccer")
kable(dirty_csv, align = 'ccccccccccccccccc')
| Student_ID | Name | Gender | Age | Fav_Fruit | Fav_Sport | Do_you_like_Math | Fav_Animal | Last_Number_ZIP |
|---|---|---|---|---|---|---|---|---|
| Student_001 | Jimmy | m | 10 | apple | football | yes | Bunny | 1 |
| Student_002 | Mathilde | f | 11 | coconut | soccer | no | NA | 2 |
| Student_003 | Laura | f | 10 | dragon fruit | ballet | yes | Horse | 3 |
| Student_004 | Selma | m | 9 | lemon | swimming | yes and no | NA | 4 |
| Student_005 | Elisa | f | 10 | apple | ballet | no | Horse | 1 |
| Student_006 | Andreas | m | 10 | banana | tennis | no not really | NA | 2 |
| Student_007 | Tom | m | 9 | pear | swimming | no | Tiger | 2 |
| Student_008 | Lily | f | 10 | orange | ballet | yes (don’t tell mom) | NA | 4 |
| Student_009 | Max | m | 11 | cherry | karate | yes | NA | 6 |
| Student_010 | Priscilla | f | 12 | strawberry | soccer | yes | Dog | 8 |
| Student_011 | Thomas | m | 13 | melon | soccer | no | NA | 8 |
| Student_021 | Liam | m | 10 | apple | soccer | yes sometimes | Cat | 6 |
| Student_022 | Leanna | f | 11 | banana | basketball | no not really | Cat | 7 |
| Student_023 | Jodi | f | 12 | orange | tennis | yes | Dog | 6 |
| Student_024 | Tammi | f | 10 | banana | hockey | no | Lion | 3 |
| Student_025 | Elliot | m | 11 | cherry | soccer | no | Tiger | 2 |
| Student_026 | Lilian | f | 12 | pear | table tennis | no | Bobcat | 2 |
| Student_027 | Noah | m | 10 | apple | tennis | yes | Whale | 1 |
| Student_028 | Jeanie | f | 11 | strawberry | ballet | yes | Unicorn | 6 |
| Student_029 | Louis | m | 12 | orange | swimming | no | Wolf | 4 |
| Student_030 | Therese | f | 13 | blood orange | karate | yes | Turtle | 5 |
Here we want to change unspecific answers –> make them definitive yes OR no
We will look at the grep() function, which allows you to ‘grab’ a pattern and apply changes to all rows that follow it.
This is more generalized and is better for columns that have few options already (male/female, yes/no).
casewhen() gives you more freedom if the row input is super diverse and you have to correct a lot.
grep() pattern search is best combined with ’*’ as a ‘wild card’ ( = find stuff related to my pattern)
df $ colname [ grep (“old_pattern*.”, df$colname) ] <- “new_pattern”
dirty_csv$Do_you_like_Math[grep("yes*", dirty_csv$Do_you_like_Math)] <- "yes"
dirty_csv$Do_you_like_Math[grep("no*", dirty_csv$Do_you_like_Math)] <- "no"
kable(dirty_csv, align = 'ccccccccccccccccc')
| Student_ID | Name | Gender | Age | Fav_Fruit | Fav_Sport | Do_you_like_Math | Fav_Animal | Last_Number_ZIP |
|---|---|---|---|---|---|---|---|---|
| Student_001 | Jimmy | m | 10 | apple | football | yes | Bunny | 1 |
| Student_002 | Mathilde | f | 11 | coconut | soccer | no | NA | 2 |
| Student_003 | Laura | f | 10 | dragon fruit | ballet | yes | Horse | 3 |
| Student_004 | Selma | m | 9 | lemon | swimming | yes | NA | 4 |
| Student_005 | Elisa | f | 10 | apple | ballet | no | Horse | 1 |
| Student_006 | Andreas | m | 10 | banana | tennis | no | NA | 2 |
| Student_007 | Tom | m | 9 | pear | swimming | no | Tiger | 2 |
| Student_008 | Lily | f | 10 | orange | ballet | yes | NA | 4 |
| Student_009 | Max | m | 11 | cherry | karate | yes | NA | 6 |
| Student_010 | Priscilla | f | 12 | strawberry | soccer | yes | Dog | 8 |
| Student_011 | Thomas | m | 13 | melon | soccer | no | NA | 8 |
| Student_021 | Liam | m | 10 | apple | soccer | yes | Cat | 6 |
| Student_022 | Leanna | f | 11 | banana | basketball | no | Cat | 7 |
| Student_023 | Jodi | f | 12 | orange | tennis | yes | Dog | 6 |
| Student_024 | Tammi | f | 10 | banana | hockey | no | Lion | 3 |
| Student_025 | Elliot | m | 11 | cherry | soccer | no | Tiger | 2 |
| Student_026 | Lilian | f | 12 | pear | table tennis | no | Bobcat | 2 |
| Student_027 | Noah | m | 10 | apple | tennis | yes | Whale | 1 |
| Student_028 | Jeanie | f | 11 | strawberry | ballet | yes | Unicorn | 6 |
| Student_029 | Louis | m | 12 | orange | swimming | no | Wolf | 4 |
| Student_030 | Therese | f | 13 | blood orange | karate | yes | Turtle | 5 |
There are different joins you can perform.
First of all, let’s rename our prior data frame and load the other table we want to join with!
# let's rename our first data frame --> Student_Data1
Student_Data1 <- dirty_csv
# let's add the second data set
Student_Data2 <- read.csv('https://raw.githubusercontent.com/tlobnow/Resources/main/R_Introduction/dirty_files/Dirty_Data2.csv', na.strings = c('', ' ', 'NA'))
kable(Student_Data2, align = 'ccccccccccccccccc')
| Student_ID | Surname | Fav_ANIMAL | Fav_Meal | Fav_City | Longitude | Latitude | Grade_English | Grade_Math | Grade_Biology |
|---|---|---|---|---|---|---|---|---|---|
| Student_001 | Bubblegum | NA | Spaghetti | Berlin | 13.83390 | 52.26940 | 1 | 1 | 1 |
| Student_002 | Burpland | Dog | Salad | Frankfurt | 13.67951 | 51.67676 | 3 | 3 | 2 |
| Student_003 | Applecake | NA | Mac and Cheese | Berlin | 14.08280 | 52.70000 | 2 | 1 | 2 |
| Student_004 | Frumpleton | Dog | Spaghetti | Munich | 13.44080 | 53.04970 | 4 | 2 | 1 |
| Student_005 | Crumbler | NA | Nuggets | New York | 13.73250 | 52.97890 | 3 | 4 | 3 |
| Student_006 | Bananabread | Cat | Pizza | Amsterdam | 14.06720 | 52.40190 | 1 | 2 | 3 |
| Student_007 | Phelangy | NA | Nuggets | Berlin | 13.73360 | 52.97830 | 1 | 3 | 2 |
| Student_008 | Malone | Meerkat | Salad | Sacramento | 13.67951 | 51.67676 | 2 | 2 | 1 |
| Student_009 | Star | Cat | Pizza | Rome | 13.53530 | 52.89690 | 2 | 1 | 2 |
| Student_010 | Squarepants | NA | Spaghetti | Berlin | 14.08170 | 52.76670 | 1 | 1 | 3 |
| Student_011 | Bumblebee | Whale | Spaghetti | Berlin | 14.08330 | 52.69670 | 2 | 2 | 3 |
| Student_012 | Ellison | Tiger | Salad | Rio de Janeiro | 13.95330 | 53.14220 | 3 | 1 | 3 |
| Student_013 | Steele | Lion | Nuggets | Tokyo | 14.09220 | 52.21250 | 4 | 1 | 2 |
| Student_014 | Reese | Dog | Spaghetti | Warsaw | 13.77720 | 52.37440 | 1 | 2 | 1 |
| Student_015 | Werner | Horse | Nuggets | Berlin | 14.09390 | 53.07060 | 1 | 2 | 1 |
| Student_016 | Odom | Cat | Pizza | Munich | 13.84734 | 51.85354 | 2 | 3 | 2 |
| Student_017 | Ray | Cat | Nuggets | Frankfurt | 13.95330 | 53.14220 | 1 | 3 | 1 |
| Student_018 | Hammond | Horse | Salad | New York | 14.08330 | 52.69670 | 3 | 1 | 2 |
| Student_019 | Kent | Dog | Pizza | New York | 13.46390 | 51.86250 | 1 | 2 | 3 |
| Student_020 | McKenzie | Parrot | Pizza | Berlin | 13.48640 | 53.12750 | 4 | 4 | 4 |
For simplicity, this table is clean already, we just want to look what happens when you join with different functions.
left_join() = join matching rows from b to a (includes all rows in a)
right_join() = join matching rows from a to b (includes all rows in b)
inner_join() = join data and retain only rows in both sets (includes all rows in a AND b)
full_join() = join data, retain all values, all rows (includes all rows in a OR b)
library(knitr)
# left join
Student_Data <- left_join(Student_Data1, Student_Data2) %>% arrange(Student_ID)
## Joining, by = "Student_ID"
kable(Student_Data, align = 'ccccccccccccccccc', caption = 'left join') # we want all inputs (17 columns) to be central = 'c'
| Student_ID | Name | Gender | Age | Fav_Fruit | Fav_Sport | Do_you_like_Math | Fav_Animal | Last_Number_ZIP | Surname | Fav_ANIMAL | Fav_Meal | Fav_City | Longitude | Latitude | Grade_English | Grade_Math | Grade_Biology |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Student_001 | Jimmy | m | 10 | apple | football | yes | Bunny | 1 | Bubblegum | NA | Spaghetti | Berlin | 13.83390 | 52.26940 | 1 | 1 | 1 |
| Student_002 | Mathilde | f | 11 | coconut | soccer | no | NA | 2 | Burpland | Dog | Salad | Frankfurt | 13.67951 | 51.67676 | 3 | 3 | 2 |
| Student_003 | Laura | f | 10 | dragon fruit | ballet | yes | Horse | 3 | Applecake | NA | Mac and Cheese | Berlin | 14.08280 | 52.70000 | 2 | 1 | 2 |
| Student_004 | Selma | m | 9 | lemon | swimming | yes | NA | 4 | Frumpleton | Dog | Spaghetti | Munich | 13.44080 | 53.04970 | 4 | 2 | 1 |
| Student_005 | Elisa | f | 10 | apple | ballet | no | Horse | 1 | Crumbler | NA | Nuggets | New York | 13.73250 | 52.97890 | 3 | 4 | 3 |
| Student_006 | Andreas | m | 10 | banana | tennis | no | NA | 2 | Bananabread | Cat | Pizza | Amsterdam | 14.06720 | 52.40190 | 1 | 2 | 3 |
| Student_007 | Tom | m | 9 | pear | swimming | no | Tiger | 2 | Phelangy | NA | Nuggets | Berlin | 13.73360 | 52.97830 | 1 | 3 | 2 |
| Student_008 | Lily | f | 10 | orange | ballet | yes | NA | 4 | Malone | Meerkat | Salad | Sacramento | 13.67951 | 51.67676 | 2 | 2 | 1 |
| Student_009 | Max | m | 11 | cherry | karate | yes | NA | 6 | Star | Cat | Pizza | Rome | 13.53530 | 52.89690 | 2 | 1 | 2 |
| Student_010 | Priscilla | f | 12 | strawberry | soccer | yes | Dog | 8 | Squarepants | NA | Spaghetti | Berlin | 14.08170 | 52.76670 | 1 | 1 | 3 |
| Student_011 | Thomas | m | 13 | melon | soccer | no | NA | 8 | Bumblebee | Whale | Spaghetti | Berlin | 14.08330 | 52.69670 | 2 | 2 | 3 |
| Student_021 | Liam | m | 10 | apple | soccer | yes | Cat | 6 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_022 | Leanna | f | 11 | banana | basketball | no | Cat | 7 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_023 | Jodi | f | 12 | orange | tennis | yes | Dog | 6 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_024 | Tammi | f | 10 | banana | hockey | no | Lion | 3 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_025 | Elliot | m | 11 | cherry | soccer | no | Tiger | 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_026 | Lilian | f | 12 | pear | table tennis | no | Bobcat | 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_027 | Noah | m | 10 | apple | tennis | yes | Whale | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_028 | Jeanie | f | 11 | strawberry | ballet | yes | Unicorn | 6 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_029 | Louis | m | 12 | orange | swimming | no | Wolf | 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_030 | Therese | f | 13 | blood orange | karate | yes | Turtle | 5 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
vis_miss(Student_Data)
# right join
Student_Data <- right_join(Student_Data1, Student_Data2) %>% arrange(Student_ID)
## Joining, by = "Student_ID"
kable(Student_Data, align = 'ccccccccccccccccc', caption = 'right join')
| Student_ID | Name | Gender | Age | Fav_Fruit | Fav_Sport | Do_you_like_Math | Fav_Animal | Last_Number_ZIP | Surname | Fav_ANIMAL | Fav_Meal | Fav_City | Longitude | Latitude | Grade_English | Grade_Math | Grade_Biology |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Student_001 | Jimmy | m | 10 | apple | football | yes | Bunny | 1 | Bubblegum | NA | Spaghetti | Berlin | 13.83390 | 52.26940 | 1 | 1 | 1 |
| Student_002 | Mathilde | f | 11 | coconut | soccer | no | NA | 2 | Burpland | Dog | Salad | Frankfurt | 13.67951 | 51.67676 | 3 | 3 | 2 |
| Student_003 | Laura | f | 10 | dragon fruit | ballet | yes | Horse | 3 | Applecake | NA | Mac and Cheese | Berlin | 14.08280 | 52.70000 | 2 | 1 | 2 |
| Student_004 | Selma | m | 9 | lemon | swimming | yes | NA | 4 | Frumpleton | Dog | Spaghetti | Munich | 13.44080 | 53.04970 | 4 | 2 | 1 |
| Student_005 | Elisa | f | 10 | apple | ballet | no | Horse | 1 | Crumbler | NA | Nuggets | New York | 13.73250 | 52.97890 | 3 | 4 | 3 |
| Student_006 | Andreas | m | 10 | banana | tennis | no | NA | 2 | Bananabread | Cat | Pizza | Amsterdam | 14.06720 | 52.40190 | 1 | 2 | 3 |
| Student_007 | Tom | m | 9 | pear | swimming | no | Tiger | 2 | Phelangy | NA | Nuggets | Berlin | 13.73360 | 52.97830 | 1 | 3 | 2 |
| Student_008 | Lily | f | 10 | orange | ballet | yes | NA | 4 | Malone | Meerkat | Salad | Sacramento | 13.67951 | 51.67676 | 2 | 2 | 1 |
| Student_009 | Max | m | 11 | cherry | karate | yes | NA | 6 | Star | Cat | Pizza | Rome | 13.53530 | 52.89690 | 2 | 1 | 2 |
| Student_010 | Priscilla | f | 12 | strawberry | soccer | yes | Dog | 8 | Squarepants | NA | Spaghetti | Berlin | 14.08170 | 52.76670 | 1 | 1 | 3 |
| Student_011 | Thomas | m | 13 | melon | soccer | no | NA | 8 | Bumblebee | Whale | Spaghetti | Berlin | 14.08330 | 52.69670 | 2 | 2 | 3 |
| Student_012 | NA | NA | NA | NA | NA | NA | NA | NA | Ellison | Tiger | Salad | Rio de Janeiro | 13.95330 | 53.14220 | 3 | 1 | 3 |
| Student_013 | NA | NA | NA | NA | NA | NA | NA | NA | Steele | Lion | Nuggets | Tokyo | 14.09220 | 52.21250 | 4 | 1 | 2 |
| Student_014 | NA | NA | NA | NA | NA | NA | NA | NA | Reese | Dog | Spaghetti | Warsaw | 13.77720 | 52.37440 | 1 | 2 | 1 |
| Student_015 | NA | NA | NA | NA | NA | NA | NA | NA | Werner | Horse | Nuggets | Berlin | 14.09390 | 53.07060 | 1 | 2 | 1 |
| Student_016 | NA | NA | NA | NA | NA | NA | NA | NA | Odom | Cat | Pizza | Munich | 13.84734 | 51.85354 | 2 | 3 | 2 |
| Student_017 | NA | NA | NA | NA | NA | NA | NA | NA | Ray | Cat | Nuggets | Frankfurt | 13.95330 | 53.14220 | 1 | 3 | 1 |
| Student_018 | NA | NA | NA | NA | NA | NA | NA | NA | Hammond | Horse | Salad | New York | 14.08330 | 52.69670 | 3 | 1 | 2 |
| Student_019 | NA | NA | NA | NA | NA | NA | NA | NA | Kent | Dog | Pizza | New York | 13.46390 | 51.86250 | 1 | 2 | 3 |
| Student_020 | NA | NA | NA | NA | NA | NA | NA | NA | McKenzie | Parrot | Pizza | Berlin | 13.48640 | 53.12750 | 4 | 4 | 4 |
vis_miss(Student_Data)
# inner join
Student_Data <- inner_join(Student_Data1, Student_Data2) %>% arrange(Student_ID)
## Joining, by = "Student_ID"
kable(Student_Data, align = 'ccccccccccccccccc', caption = 'inner join')
| Student_ID | Name | Gender | Age | Fav_Fruit | Fav_Sport | Do_you_like_Math | Fav_Animal | Last_Number_ZIP | Surname | Fav_ANIMAL | Fav_Meal | Fav_City | Longitude | Latitude | Grade_English | Grade_Math | Grade_Biology |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Student_001 | Jimmy | m | 10 | apple | football | yes | Bunny | 1 | Bubblegum | NA | Spaghetti | Berlin | 13.83390 | 52.26940 | 1 | 1 | 1 |
| Student_002 | Mathilde | f | 11 | coconut | soccer | no | NA | 2 | Burpland | Dog | Salad | Frankfurt | 13.67951 | 51.67676 | 3 | 3 | 2 |
| Student_003 | Laura | f | 10 | dragon fruit | ballet | yes | Horse | 3 | Applecake | NA | Mac and Cheese | Berlin | 14.08280 | 52.70000 | 2 | 1 | 2 |
| Student_004 | Selma | m | 9 | lemon | swimming | yes | NA | 4 | Frumpleton | Dog | Spaghetti | Munich | 13.44080 | 53.04970 | 4 | 2 | 1 |
| Student_005 | Elisa | f | 10 | apple | ballet | no | Horse | 1 | Crumbler | NA | Nuggets | New York | 13.73250 | 52.97890 | 3 | 4 | 3 |
| Student_006 | Andreas | m | 10 | banana | tennis | no | NA | 2 | Bananabread | Cat | Pizza | Amsterdam | 14.06720 | 52.40190 | 1 | 2 | 3 |
| Student_007 | Tom | m | 9 | pear | swimming | no | Tiger | 2 | Phelangy | NA | Nuggets | Berlin | 13.73360 | 52.97830 | 1 | 3 | 2 |
| Student_008 | Lily | f | 10 | orange | ballet | yes | NA | 4 | Malone | Meerkat | Salad | Sacramento | 13.67951 | 51.67676 | 2 | 2 | 1 |
| Student_009 | Max | m | 11 | cherry | karate | yes | NA | 6 | Star | Cat | Pizza | Rome | 13.53530 | 52.89690 | 2 | 1 | 2 |
| Student_010 | Priscilla | f | 12 | strawberry | soccer | yes | Dog | 8 | Squarepants | NA | Spaghetti | Berlin | 14.08170 | 52.76670 | 1 | 1 | 3 |
| Student_011 | Thomas | m | 13 | melon | soccer | no | NA | 8 | Bumblebee | Whale | Spaghetti | Berlin | 14.08330 | 52.69670 | 2 | 2 | 3 |
vis_miss(Student_Data)
# full join
Student_Data <- full_join(Student_Data1, Student_Data2) %>% arrange(Student_ID)
## Joining, by = "Student_ID"
kable(Student_Data, align = 'ccccccccccccccccc', caption = 'full join')
| Student_ID | Name | Gender | Age | Fav_Fruit | Fav_Sport | Do_you_like_Math | Fav_Animal | Last_Number_ZIP | Surname | Fav_ANIMAL | Fav_Meal | Fav_City | Longitude | Latitude | Grade_English | Grade_Math | Grade_Biology |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Student_001 | Jimmy | m | 10 | apple | football | yes | Bunny | 1 | Bubblegum | NA | Spaghetti | Berlin | 13.83390 | 52.26940 | 1 | 1 | 1 |
| Student_002 | Mathilde | f | 11 | coconut | soccer | no | NA | 2 | Burpland | Dog | Salad | Frankfurt | 13.67951 | 51.67676 | 3 | 3 | 2 |
| Student_003 | Laura | f | 10 | dragon fruit | ballet | yes | Horse | 3 | Applecake | NA | Mac and Cheese | Berlin | 14.08280 | 52.70000 | 2 | 1 | 2 |
| Student_004 | Selma | m | 9 | lemon | swimming | yes | NA | 4 | Frumpleton | Dog | Spaghetti | Munich | 13.44080 | 53.04970 | 4 | 2 | 1 |
| Student_005 | Elisa | f | 10 | apple | ballet | no | Horse | 1 | Crumbler | NA | Nuggets | New York | 13.73250 | 52.97890 | 3 | 4 | 3 |
| Student_006 | Andreas | m | 10 | banana | tennis | no | NA | 2 | Bananabread | Cat | Pizza | Amsterdam | 14.06720 | 52.40190 | 1 | 2 | 3 |
| Student_007 | Tom | m | 9 | pear | swimming | no | Tiger | 2 | Phelangy | NA | Nuggets | Berlin | 13.73360 | 52.97830 | 1 | 3 | 2 |
| Student_008 | Lily | f | 10 | orange | ballet | yes | NA | 4 | Malone | Meerkat | Salad | Sacramento | 13.67951 | 51.67676 | 2 | 2 | 1 |
| Student_009 | Max | m | 11 | cherry | karate | yes | NA | 6 | Star | Cat | Pizza | Rome | 13.53530 | 52.89690 | 2 | 1 | 2 |
| Student_010 | Priscilla | f | 12 | strawberry | soccer | yes | Dog | 8 | Squarepants | NA | Spaghetti | Berlin | 14.08170 | 52.76670 | 1 | 1 | 3 |
| Student_011 | Thomas | m | 13 | melon | soccer | no | NA | 8 | Bumblebee | Whale | Spaghetti | Berlin | 14.08330 | 52.69670 | 2 | 2 | 3 |
| Student_012 | NA | NA | NA | NA | NA | NA | NA | NA | Ellison | Tiger | Salad | Rio de Janeiro | 13.95330 | 53.14220 | 3 | 1 | 3 |
| Student_013 | NA | NA | NA | NA | NA | NA | NA | NA | Steele | Lion | Nuggets | Tokyo | 14.09220 | 52.21250 | 4 | 1 | 2 |
| Student_014 | NA | NA | NA | NA | NA | NA | NA | NA | Reese | Dog | Spaghetti | Warsaw | 13.77720 | 52.37440 | 1 | 2 | 1 |
| Student_015 | NA | NA | NA | NA | NA | NA | NA | NA | Werner | Horse | Nuggets | Berlin | 14.09390 | 53.07060 | 1 | 2 | 1 |
| Student_016 | NA | NA | NA | NA | NA | NA | NA | NA | Odom | Cat | Pizza | Munich | 13.84734 | 51.85354 | 2 | 3 | 2 |
| Student_017 | NA | NA | NA | NA | NA | NA | NA | NA | Ray | Cat | Nuggets | Frankfurt | 13.95330 | 53.14220 | 1 | 3 | 1 |
| Student_018 | NA | NA | NA | NA | NA | NA | NA | NA | Hammond | Horse | Salad | New York | 14.08330 | 52.69670 | 3 | 1 | 2 |
| Student_019 | NA | NA | NA | NA | NA | NA | NA | NA | Kent | Dog | Pizza | New York | 13.46390 | 51.86250 | 1 | 2 | 3 |
| Student_020 | NA | NA | NA | NA | NA | NA | NA | NA | McKenzie | Parrot | Pizza | Berlin | 13.48640 | 53.12750 | 4 | 4 | 4 |
| Student_021 | Liam | m | 10 | apple | soccer | yes | Cat | 6 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_022 | Leanna | f | 11 | banana | basketball | no | Cat | 7 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_023 | Jodi | f | 12 | orange | tennis | yes | Dog | 6 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_024 | Tammi | f | 10 | banana | hockey | no | Lion | 3 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_025 | Elliot | m | 11 | cherry | soccer | no | Tiger | 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_026 | Lilian | f | 12 | pear | table tennis | no | Bobcat | 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_027 | Noah | m | 10 | apple | tennis | yes | Whale | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_028 | Jeanie | f | 11 | strawberry | ballet | yes | Unicorn | 6 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_029 | Louis | m | 12 | orange | swimming | no | Wolf | 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_030 | Therese | f | 13 | blood orange | karate | yes | Turtle | 5 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
vis_miss(Student_Data)
There are two csv files that contain the remaining, missing data:
Student_Data3
Student_Data4
Student_Data3 <- read.csv('https://raw.githubusercontent.com/tlobnow/Resources/main/R_Introduction/dirty_files/Dirty_Data3.csv', na.strings = c('', ' ', 'NA'))
Student_Data4 <- read.csv('https://raw.githubusercontent.com/tlobnow/Resources/main/R_Introduction/dirty_files/Dirty_Data4.csv', na.strings = c('', ' ', 'NA'))
# let's use full join
Student_Data <- full_join(Student_Data, Student_Data3) %>% arrange(Student_ID)
## Joining, by = c("Student_ID", "Name", "Gender", "Age", "Fav_Fruit", "Fav_Sport",
## "Do_you_like_Math", "Last_Number_ZIP")
Student_Data <- full_join(Student_Data, Student_Data4) %>% arrange(Student_ID)
## Joining, by = c("Student_ID", "Surname", "Fav_ANIMAL", "Fav_Meal", "Fav_City",
## "Longitude", "Latitude", "Grade_English", "Grade_Math", "Grade_Biology")
rm(Student_Data1, Student_Data2, Student_Data3, Student_Data4)
kable(Student_Data, align = 'ccccccccccccccccc', caption = 'full join')
| Student_ID | Name | Gender | Age | Fav_Fruit | Fav_Sport | Do_you_like_Math | Fav_Animal | Last_Number_ZIP | Surname | Fav_ANIMAL | Fav_Meal | Fav_City | Longitude | Latitude | Grade_English | Grade_Math | Grade_Biology |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Student_001 | Jimmy | m | 10 | apple | football | yes | Bunny | 1 | Bubblegum | NA | Spaghetti | Berlin | 13.83390 | 52.26940 | 1 | 1 | 1 |
| Student_002 | Mathilde | f | 11 | coconut | soccer | no | NA | 2 | Burpland | Dog | Salad | Frankfurt | 13.67951 | 51.67676 | 3 | 3 | 2 |
| Student_003 | Laura | f | 10 | dragon fruit | ballet | yes | Horse | 3 | Applecake | NA | Mac and Cheese | Berlin | 14.08280 | 52.70000 | 2 | 1 | 2 |
| Student_004 | Selma | m | 9 | lemon | swimming | yes | NA | 4 | Frumpleton | Dog | Spaghetti | Munich | 13.44080 | 53.04970 | 4 | 2 | 1 |
| Student_005 | Elisa | f | 10 | apple | ballet | no | Horse | 1 | Crumbler | NA | Nuggets | New York | 13.73250 | 52.97890 | 3 | 4 | 3 |
| Student_006 | Andreas | m | 10 | banana | tennis | no | NA | 2 | Bananabread | Cat | Pizza | Amsterdam | 14.06720 | 52.40190 | 1 | 2 | 3 |
| Student_007 | Tom | m | 9 | pear | swimming | no | Tiger | 2 | Phelangy | NA | Nuggets | Berlin | 13.73360 | 52.97830 | 1 | 3 | 2 |
| Student_008 | Lily | f | 10 | orange | ballet | yes | NA | 4 | Malone | Meerkat | Salad | Sacramento | 13.67951 | 51.67676 | 2 | 2 | 1 |
| Student_009 | Max | m | 11 | cherry | karate | yes | NA | 6 | Star | Cat | Pizza | Rome | 13.53530 | 52.89690 | 2 | 1 | 2 |
| Student_010 | Priscilla | f | 12 | strawberry | soccer | yes | Dog | 8 | Squarepants | NA | Spaghetti | Berlin | 14.08170 | 52.76670 | 1 | 1 | 3 |
| Student_011 | Thomas | m | 13 | melon | soccer | no | NA | 8 | Bumblebee | Whale | Spaghetti | Berlin | 14.08330 | 52.69670 | 2 | 2 | 3 |
| Student_012 | NA | NA | NA | NA | NA | NA | NA | NA | Ellison | Tiger | Salad | Rio de Janeiro | 13.95330 | 53.14220 | 3 | 1 | 3 |
| Student_012 | Susie | f | 10 | pear | baseball | yes | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_013 | NA | NA | NA | NA | NA | NA | NA | NA | Steele | Lion | Nuggets | Tokyo | 14.09220 | 52.21250 | 4 | 1 | 2 |
| Student_013 | Frank | m | 11 | apple | basketball | yes | NA | 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_014 | NA | NA | NA | NA | NA | NA | NA | NA | Reese | Dog | Spaghetti | Warsaw | 13.77720 | 52.37440 | 1 | 2 | 1 |
| Student_014 | Francis | f | 12 | banana | tennis | no | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_015 | NA | NA | NA | NA | NA | NA | NA | NA | Werner | Horse | Nuggets | Berlin | 14.09390 | 53.07060 | 1 | 2 | 1 |
| Student_015 | Rene | m | 11 | cherry | soccer | no | NA | 3 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_016 | NA | NA | NA | NA | NA | NA | NA | NA | Odom | Cat | Pizza | Munich | 13.84734 | 51.85354 | 2 | 3 | 2 |
| Student_016 | Linus | m | 11 | pear | soccer | yes | NA | 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_017 | NA | NA | NA | NA | NA | NA | NA | NA | Ray | Cat | Nuggets | Frankfurt | 13.95330 | 53.14220 | 1 | 3 | 1 |
| Student_017 | Findus | m | 10 | banana | hockey | no | NA | 7 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_018 | NA | NA | NA | NA | NA | NA | NA | NA | Hammond | Horse | Salad | New York | 14.08330 | 52.69670 | 3 | 1 | 2 |
| Student_018 | Lena | f | 11 | strawberry | dancing | yes | NA | 6 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_019 | NA | NA | NA | NA | NA | NA | NA | NA | Kent | Dog | Pizza | New York | 13.46390 | 51.86250 | 1 | 2 | 3 |
| Student_019 | Lisa | f | 12 | coconut | soccer | no | NA | 8 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_020 | NA | NA | NA | NA | NA | NA | NA | NA | McKenzie | Parrot | Pizza | Berlin | 13.48640 | 53.12750 | 4 | 4 | 4 |
| Student_020 | Paul | m | 10 | kiwi | ballet | no | NA | 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_021 | Liam | m | 10 | apple | soccer | yes | Cat | 6 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_021 | NA | NA | NA | NA | NA | NA | NA | NA | Smith | NA | Tagliatelle | Berlin | 13.73360 | 52.97830 | 1 | 2 | 1 |
| Student_022 | Leanna | f | 11 | banana | basketball | no | Cat | 7 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_022 | NA | NA | NA | NA | NA | NA | NA | NA | Jones | NA | Pizza | London | 13.89500 | 52.95810 | 2 | 2 | 2 |
| Student_023 | Jodi | f | 12 | orange | tennis | yes | Dog | 6 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_023 | NA | NA | NA | NA | NA | NA | NA | NA | Meyer | NA | Nuggets | Frankfurt | 13.58830 | 52.21560 | 3 | 3 | 2 |
| Student_024 | Tammi | f | 10 | banana | hockey | no | Lion | 3 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_024 | NA | NA | NA | NA | NA | NA | NA | NA | Otto | NA | Chocolate Cake | New York | 14.14140 | 53.07830 | 1 | 3 | 3 |
| Student_025 | Elliot | m | 11 | cherry | soccer | no | Tiger | 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_025 | NA | NA | NA | NA | NA | NA | NA | NA | Petterson | NA | Spaghetti | Shanghai | 14.11610 | 52.79140 | 2 | 1 | 4 |
| Student_026 | Lilian | f | 12 | pear | table tennis | no | Bobcat | 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_026 | NA | NA | NA | NA | NA | NA | NA | NA | Ingman | NA | Salad | New York | 13.89500 | 52.95810 | 2 | 1 | 1 |
| Student_027 | Noah | m | 10 | apple | tennis | yes | Whale | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_027 | NA | NA | NA | NA | NA | NA | NA | NA | Stine | NA | Chicken | Stockholm | 13.44080 | 53.04970 | 3 | 3 | 1 |
| Student_028 | Jeanie | f | 11 | strawberry | ballet | yes | Unicorn | 6 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_028 | NA | NA | NA | NA | NA | NA | NA | NA | Beckett | NA | Curry | Stavanger | 11.60690 | 53.94049 | 3 | 2 | 2 |
| Student_029 | Louis | m | 12 | orange | swimming | no | Wolf | 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_029 | NA | NA | NA | NA | NA | NA | NA | NA | Blyton | NA | Pizza | Frankfurt | 13.60250 | 52.22250 | 1 | 2 | 3 |
| Student_030 | Therese | f | 13 | blood orange | karate | yes | Turtle | 5 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Student_030 | NA | NA | NA | NA | NA | NA | NA | NA | Morgan | NA | Spaghetti | Berlin | 13.84734 | 51.85354 | 1 | 1 | 4 |
vis_miss(Student_Data)
Let’s see how we can get rid of the duplicate Student_IDs and fill the missing data
Student_Data <- Student_Data %>% arrange(Student_ID) %>% group_by(Student_ID) %>% fill(c(everything()), .direction = "downup") %>% ungroup() %>% distinct(Student_ID, .keep_all = T)
Let’s look at the the table.
Do we still have missing data?
vis_miss(Student_Data)
Fav_Animal is the correct name of the column, so we want to end up with that column name and ALL the data in that column.
Currently, we have two partially filled columns - how can we solve that?
Any blanks left?
Temp_df <- Student_Data %>% select(Student_ID, Fav_Animal, Fav_ANIMAL) %>%
pivot_longer(names_to = "Temp", values_to = "Fav_Animal", cols = c(Fav_Animal, Fav_ANIMAL)) %>%
arrange(Student_ID) %>%
group_by(Student_ID) %>%
fill(c(everything()), .direction = "downup") %>%
ungroup() %>%
select(Student_ID, Fav_Animal) %>%
distinct(Student_ID, .keep_all = T)
kable(Temp_df, align = 'ccc', caption = 'Temporary Table')
| Student_ID | Fav_Animal |
|---|---|
| Student_001 | Bunny |
| Student_002 | Dog |
| Student_003 | Horse |
| Student_004 | Dog |
| Student_005 | Horse |
| Student_006 | Cat |
| Student_007 | Tiger |
| Student_008 | Meerkat |
| Student_009 | Cat |
| Student_010 | Dog |
| Student_011 | Whale |
| Student_012 | Tiger |
| Student_013 | Lion |
| Student_014 | Dog |
| Student_015 | Horse |
| Student_016 | Cat |
| Student_017 | Cat |
| Student_018 | Horse |
| Student_019 | Dog |
| Student_020 | Parrot |
| Student_021 | Cat |
| Student_022 | Cat |
| Student_023 | Dog |
| Student_024 | Lion |
| Student_025 | Tiger |
| Student_026 | Bobcat |
| Student_027 | Whale |
| Student_028 | Unicorn |
| Student_029 | Wolf |
| Student_030 | Turtle |
Student_Data <- full_join(Student_Data, Temp_df) %>%
select(-c(Fav_ANIMAL)) %>%
arrange(Student_ID) %>%
group_by(Student_ID) %>%
fill(c(everything()), .direction = "downup") %>%
ungroup() %>%
distinct(Student_ID, .keep_all = T)
## Joining, by = c("Student_ID", "Fav_Animal")
rm(Temp_df)
vis_miss(Student_Data)
# Btw, you can combine the case_when() function with different conditionals:
# < smaller than
# > bigger than
# <= smaller or equal to
# >= bigger or equal to
# == equal to
# != not equal to
# & logical 'AND' condition
# | logical 'OR' condition
Student_Data <- Student_Data %>% mutate(
Performance_Mean = (Grade_English + Grade_Math + Grade_Biology)/3,
Student_Eval = case_when(Performance_Mean < 1.5 ~ 'excellent',
Performance_Mean >= 1.5 & Performance_Mean < 2 ~ 'very good',
Performance_Mean >= 2 & Performance_Mean < 3 ~ 'good',
Performance_Mean >= 3 & Performance_Mean < 4 ~ 'average',
Performance_Mean >= 4 ~ 'bad'))
Data Visualization will be done using two main packages ‘ggplot2’ and ‘leaflet’.
ggplot2 is used for general visualization like dot plots, bar plots, histograms, linear regression, ..
More info on violin plots you can find here
A more in-depth introduction to data visualization with ggplot is here
We will start by looking at the ‘iris’ dataset provided by R
It has 5 columns and 150 rows
glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
# Let's visualize with a dot plot first:
# compare Sepal.Length and Sepal.Width, colored by Species
# the regression lines are per species, no standard deviation ribbon
iris %>%
ggplot(aes(Sepal.Length, Sepal.Width, col = Species)) +
geom_point() +
geom_smooth(method = 'lm', se = F)
## `geom_smooth()` using formula 'y ~ x'
# plot(iris$Sepal.Length, iris$Sepal.Width) # base functions are always an alternative!
# violin plot & boxplot together
iris %>%
ggplot(aes(Species, Petal.Width)) +
geom_violin(aes(col = Species, fill = Species, alpha = 0.3)) +
geom_boxplot(aes(alpha = 0)) +
ggtitle('Iris Boxplot + Violin plot') +
xlab('Species') +
ylab('Petal Width')
# Histogram
iris %>%
ggplot(aes(Sepal.Width)) +
geom_histogram(bins = 12)
# hist(iris$Sepal.Width) # base functions are always an alternative!
There are great color palettes already available in R, here is a summary of the most useful ones.
I tend to use the RColorBrewer package a lot, as it offers different kinds of palettes:
To check out the different options of the package, type display.brewer.all() after loading library(RColorBrewer)
To use a palette, you have to add the brewer palette to your graphic (e.g. in a ggplot), as follows:
library(RColorBrewer) #display.brewer.all()
iris %>%
ggplot(aes(Sepal.Length, Sepal.Width, col = Species)) +
geom_point() +
scale_color_brewer(palette = 'Dark2',
direction = 1) # direction = -1 would be inverted color order
Just for the sake of showing it to you:
You can also set up your own palettes
I did this to get the HI gradient we have used for quite some time in publications.
# Load Palette ####
r <- c(0, 64, 128, 179, 217, 255)
g <- c(0, 12, 25, 25, 12, 0)
b <- c(255, 249, 243, 191, 95, 0)
myPal <- function (n, name = c("myPal.colors"))
{
myPal.colors = rgb(r,g,b,maxColorValue = 255)
name = match.arg(name)
orig = eval(parse(text = name))
rgb = t(col2rgb(orig))
temp = matrix(NA, ncol = 3, nrow = n)
x = seq(0, 1, , length(orig))
xg = seq(0, 1, , n)
for (k in 1:3) {
hold = spline(x, rgb[, k], n = n)$y
hold[hold < 0] = 0
hold[hold > 255] = 255
temp[, k] = round(hold)
}
palette = rgb(temp[, 1], temp[, 2], temp[, 3], maxColorValue = 255)
palette
}
Anscombe (1973) has a nice example where he uses a constructed dataset to emphasize the importance of using graphs in statistical analysis.
There are 8 variables, representing four pairings of an outcome and a predictor.
All sets have 11 observations, the same mean of x (9) and y (7.5), the same fitted regression line (y = 3 + 0.5 x), the same regression and residual sum of squares and therefore the same multiple R-squared of 0.67.
If you want to learn more about Correlation and Regression, there is a great website with different datasets and problem sets based on different regressions here.
If you are interested in the Datacamp course on Correlation and Regression, there is a free course available here
data <- datasets::anscombe # already a base R data set!
# we have to prepare this data set a tiny bit to separate the data into sets
Anscombe <- data.frame(
set = rep(1:4, each = 11),
x = unlist(data[ ,c(1:4)]),
y = unlist(data[ ,c(5:8)])
)
rownames(Anscombe) <- NULL
head(Anscombe)
## set x y
## 1 1 10 8.04
## 2 1 8 6.95
## 3 1 13 7.58
## 4 1 9 8.81
## 5 1 11 8.33
## 6 1 14 9.96
Anscombe %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
facet_wrap(~set) +
theme_gray() # you can adjust themes as well
Anscombe %>%
group_by(set) %>%
summarize(N = n(),
mean_x = mean(x),
sd_x = sd(x),
mean_y = mean(y),
sd_y = sd(y),
cor_between_x_and_y = cor(x,y))
## # A tibble: 4 × 7
## set N mean_x sd_x mean_y sd_y cor_between_x_and_y
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 11 9 3.32 7.50 2.03 0.816
## 2 2 11 9 3.32 7.50 2.03 0.816
## 3 3 11 9 3.32 7.5 2.03 0.816
## 4 4 11 9 3.32 7.50 2.03 0.817
# Same same but different!
# SO: always remember that visual analysis is important!
Let’s do some data visualization using the actual SOTA Data product now.
The data frame is pretty big, it currently has 117 columns and 1921 rows.
We want to look at visualization of some data (I will take advantage of this and show you Crypto-related stuff)
SOTA <- read.csv('https://raw.githubusercontent.com/derele/Mouse_Eimeria_Field/master/data_products/SOTA_Data_Product.csv')
# we want Brandenburg HMHZ samples only
SOTA <- SOTA[SOTA$Mouse_ID %like% "AA_", ]
# we want to compare the catching rate overall to the mice that were Crypto-positive
All_Samples <- SOTA %>% group_by(Year) %>% count()
Pos_Samples <- SOTA %>% filter(ILWE_Crypto_Ct > 0, Year >= 2016) %>% group_by(Year) %>% count()
# let's combine both and calculate the prevalence
Samples_Yr <- full_join(Pos_Samples, All_Samples, by = "Year") %>%
mutate(Prevalence = (n.x / n.y)) %>%
filter(!is.na(Prevalence))
# let's visualize that with a bar plot (geom_col is like geom_bar, but can take both x AND y)
Samples_Yr %>%
ggplot(aes(x = Year, label = Prevalence)) +
geom_col(aes(y = n.y, fill = "blue")) +
geom_col(aes(y = n.x, fill = "red")) +
geom_text(aes(label = percent(Prevalence),
y = (n.x / n.y)), nudge_y = 9) +
labs(y = "Samples [n]") +
theme(legend.position = "none") +
ggtitle("Cryptosporidium spp. prevalence in the HMHZ since 2016")
To predict oocysts (or better oocyst equivalents, as we technically quantify oocyst DNA), we need a couple of things:
(This is predicted not on SOTA, but on my Crypto Data product and joined with SOTA later)
## Mouse_ID Year HI Oocyst_Predict_Crypto.1
## 278 AA_0322 2017 0.1500000 974580
## 479 AA_0523 2017 0.4210526 152390
## 98 AA_0144 2016 0.8695652 53344
## 281 AA_0325 2017 0.8000000 40962
## 648 AA_0689 2019 0.6875000 20370
## 163 AA_0209 2016 0.7391304 18174
## 60 AA_0106 2016 0.8947368 17247
## 477 AA_0521 2017 0.3913043 15872
## 236 AA_0282 2017 0.1666667 14534
## 293 AA_0337 2017 0.9130435 14338
Let’s look at mapping with leaflet now.
This is a fantastic tool to get a geographic overview of your data and include specific information using color codes and legends.
Leaflet maps are interactive and you can load the SOTA coordinates of our mice quite easily.
We need a couple of packages to make this work, but again - just an overview of possibilities for you!
To check out other maps from third-party providers, type ‘addProviderTiles(providers$)’ and push tab after the dollar sign.
You can add different markers, I used circle markers that allow popup information, but you may want to use other ones.
library(RColorBrewer) #display.brewer.all()
library(leaflet)
library(htmltools)
SOTA <- read.csv("https://raw.githubusercontent.com/derele/Mouse_Eimeria_Field/master/data_products/SOTA_Data_Product.csv") %>% filter(!is.na(Latitude), !is.na(Longitude), !is.na(HI), !Mouse_ID %in% c("SK_2697", "SK_3173", "SK_905"))
Crypto_Detection <- read.csv("https://raw.githubusercontent.com/derele/Mouse_Eimeria_Field/master/data_products/Crypto_Detection.csv")
Crypto_Detection_tested <- read.csv("https://raw.githubusercontent.com/derele/Mouse_Eimeria_Field/master/data_products/Crypto_Detection.csv") %>% filter(ILWE_Crypto_Ct >= 0, !is.na(Latitude), !is.na(Longitude))
Crypto_Positive <- read.csv("https://raw.githubusercontent.com/derele/Mouse_Eimeria_Field/master/data_products/Crypto_Detection.csv") %>% filter(ILWE_Crypto_Ct > 0, !is.na(Latitude), !is.na(Longitude))
Crypto_Detection_21 <- read.csv("https://raw.githubusercontent.com/derele/Mouse_Eimeria_Field/master/data_products/Crypto_Detection.csv") %>% filter(Year == 2021, ILWE_Crypto_Ct > 0, !is.na(Latitude), !is.na(Longitude))
High_Infection_Samples <- SOTA %>%
select(Mouse_ID, ILWE_Crypto_Ct, Oocyst_Predict_Crypto, Year, Latitude, Longitude, HI, Sex) %>%
filter(ILWE_Crypto_Ct > 0,
Year >= 2016) %>%
arrange(ILWE_Crypto_Ct) %>%
head(20)
#- EIM_INFECTED ... This location had samples with Eimeria Infection, radius == Infection intensity (delta_ct_cewe_MminusE)
SOTA <- SOTA %>%
mutate(Eim_Species = ifelse(eimeriaSpecies == "E_falciformis", "E_falciformis",
ifelse(eimeriaSpecies == "E_ferrisi", "E_ferrisi",
ifelse(eimeriaSpecies == "Eimeria_alorani", "Eimeria_alorani",
ifelse(eimeriaSpecies == "Eimeria_apionodes", "Eimeria_apionodes",
ifelse(eimeriaSpecies == "Eimeria_falciformis", "Eimeria_falciformis",
ifelse(eimeriaSpecies == "Eimeria_sp_Apodemus", "Eimeria_sp_Apodemus",
ifelse(eimeriaSpecies == "Eimeria_vermiformis", "Eimeria_vermiformis",
ifelse(eimeriaSpecies == "Negative", "Negative",
ifelse(NA))))))))),
Eimeria_Positive = case_when(Eim_Species != "Negative" | Ct.Eimeria > 0 | OPG > 0 ~ T,
Eim_Species == "Negative" | Ct.Eimeria == 0 | OPG == 0 ~ F))
# We want to look at Eim_infected samples in our map, so we extract a dataframe with samples that fulfill this condition
EIM_INFECTED <- SOTA %>% filter(Eimeria_Positive == T)
# Cutting the data into slices helps to make nice legends!
SOTA$HI <- as.numeric(SOTA$HI)
SOTA$HI_Level <- cut(SOTA$HI, c(0, 0.001, 0.250, 0.500, 0.750, 0.999, 1), include.lowest = T ,
labels = c('HI = 0', 'HI < 0.25', 'HI < 0.5', 'HI < 0.75', 'HI < 1', 'HI = 1'))
# We also want to look at samples that sent into sequencing (was interesting for my Bachelor thesis)
Sequenced <- Crypto_Detection %>%
mutate(seq = Mouse_ID %in% c("AA_0144", "AA_0325", "AA_0689", "AA_0209", "AA_0282", "AA_0793", "AA_0667", "AA_0805", "AA_0900",
"AA_0523", "AA_0534", "AA_0537", "AA_0545", "AA_0553", "AA_0554", "AA_0578", "AA_0580", "AA_0585",
"AA_0546", "AA_0589", "AA_0571", "AA_0667")) %>%
filter(seq == T) # only retain samples that fulfill this condition
data_col_HI = colorFactor(myPal(6), SOTA$HI) # for our samples, uncut
data_col_HI_Level = colorFactor(myPal(6), SOTA$HI_Level) # for our legend palette, cut
# We must provide a map to which we anchor the data.
# Check out other third party tile providers using: addProviderTiles(providers$)
# CartoDB worked best for the colors I used
map <- Crypto_Detection_21 %>%
leaflet() %>%
addProviderTiles("CartoDB") %>%
setView(lat = 52.520007, lng =13.404954, zoom = 7)
map %>%
addPolylines(lat = c(55.0000, 53.6000, 53.51885, 52.8875 , 52.6053, 51.8978, 45.0000),
lng = c(10.0000, 11.4563, 12.4464,13.8119 , 13.8756, 13.8103, 13.5000),
color = "purple",
weight = 55,
opacity = 0.1) %>%
addCircleMarkers(data = EIM_INFECTED,
col = "#72cac3",
label = ~htmlEscape(Mouse_ID),
popup = ~paste("<b>Mouse_ID:<b>",as.character(Mouse_ID), "<br>",
"<b>Location:<b>", as.character(Latitude), "<b>,<b>", as.character(Longitude), "<br>",
"<b>HI:<b>", as.character(round(HI, digits = 2)), "<br>",
"<b>Year:<b>", as.character(Year),"<br>",
"<b>Ct Mean:<b>", as.character(round(ILWE_Crypto_Ct, digits = 2)),"<br>",
"<b>Oocyst Prediction:<b>", as.character(Oocyst_Predict_Crypto), "<br>",
"<b>Sex:<b>", Sex, "<br>",
sep=" "),
opacity = 0.3,
radius = sqrt((EIM_INFECTED$delta_ct_cewe_MminusE)^2),
group = "Eim_Infected") %>%
addCircleMarkers(data = SOTA,
col = ~data_col_HI(HI),
label = ~htmlEscape(Mouse_ID),
popup = ~paste("<b>Mouse_ID:<b>",as.character(Mouse_ID), "<br>",
"<b>Location:<b>", as.character(Latitude), "<b>,<b>", as.character(Longitude), "<br>",
"<b>HI:<b>", as.character(round(HI, digits = 2)), "<br>",
"<b>Year:<b>", as.character(Year),"<br>",
"<b>Ct Mean:<b>", as.character(round(ILWE_Crypto_Ct, digits = 2)),"<br>",
"<b>Oocyst Prediction:<b>", as.character(Oocyst_Predict_Crypto), "<br>",
"<b>Sex:<b>", Sex, "<br>",
sep=" "),
opacity = 1,
radius = 3,
group = "Samples (total)") %>%
addCircleMarkers(data = Crypto_Detection_tested,
col = ~data_col_HI(HI),
label = ~htmlEscape(Mouse_ID),
popup = ~paste("<b>Mouse_ID:<b>",as.character(Mouse_ID), "<br>",
"<b>Location:<b>", as.character(Latitude), "<b>,<b>", as.character(Longitude), "<br>",
"<b>HI:<b>", as.character(round(HI, digits = 2)), "<br>",
"<b>Year:<b>", as.character(Year),"<br>",
"<b>Ct Mean:<b>", as.character(round(ILWE_Crypto_Ct, digits = 2)),"<br>",
"<b>Oocyst Prediction:<b>", as.character(Oocyst_Predict_Crypto), "<br>",
"<b>Sex:<b>", Sex, "<br>",
sep=" "),
opacity = 1,
radius = 3,
group = "Samples (Crypto-tested)") %>%
addCircleMarkers(data = Crypto_Positive,
col = ~data_col_HI(HI),
label = ~htmlEscape(Mouse_ID),
popup = ~paste("<b>Mouse_ID:<b>",as.character(Mouse_ID), "<br>",
"<b>Location:<b>", as.character(Latitude), "<b>,<b>", as.character(Longitude), "<br>",
"<b>HI:<b>", as.character(round(HI, digits = 2)), "<br>",
"<b>Year:<b>", as.character(Year),"<br>",
"<b>Ct Mean:<b>", as.character(round(ILWE_Crypto_Ct, digits = 2)),"<br>",
"<b>Oocyst Prediction:<b>", as.character(Oocyst_Predict_Crypto), "<br>",
"<b>Sex:<b>", Sex, "<br>",
sep=" "),
opacity = 1,
radius = 3,
group = "Samples (Crypto-positive)") %>%
addCircleMarkers(data = Sequenced,
col = ~data_col_HI(HI),
label = ~htmlEscape(Mouse_ID),
popup = ~paste("<b>Mouse_ID:<b>",as.character(Mouse_ID), "<br>",
"<b>Location:<b>", as.character(Latitude), "<b>,<b>", as.character(Longitude), "<br>",
"<b>HI:<b>", as.character(round(HI, digits = 2)), "<br>",
"<b>Year:<b>", as.character(Year),"<br>",
"<b>Ct Mean:<b>", as.character(round(ILWE_Crypto_Ct, digits = 2)),"<br>",
"<b>Oocyst Prediction:<b>", as.character(Oocyst_Predict_Crypto), "<br>",
"<b>Sex:<b>", Sex, "<br>",
sep=" "),
opacity = 5,
radius = 3,
group = "Sequenced") %>%
addCircleMarkers(data = High_Infection_Samples,
color = ~data_col_HI(HI),
label = ~htmlEscape(Mouse_ID),
popup = ~paste("<b>Mouse_ID:<b>",as.character(Mouse_ID), "<br>",
"<b>Location:<b>", as.character(Latitude), "<b>,<b>", as.character(Longitude), "<br>",
"<b>HI:<b>", as.character(round(HI, digits = 2)), "<br>",
"<b>Year:<b>", as.character(Year),"<br>",
"<b>Ct Mean:<b>", as.character(round(ILWE_Crypto_Ct, digits = 2)),"<br>",
"<b>Oocyst Prediction:<b>", as.character(Oocyst_Predict_Crypto), "<br>",
"<b>Sex:<b>", Sex, "<br>",
sep=" "),
opacity = 3,
radius = 3,
group = "High_Infection_Samples") %>%
addLegend("bottomleft",
pal = data_col_HI_Level,
title = "HI",
values = SOTA$HI_Level,
group = c('= 0', '< 0.25', '< 0.5', '< 0.75', '< 1', '= 1'),
opacity = 1) %>%
addLayersControl(baseGroups = c("Samples (total)", "Samples (Crypto-tested)", "Samples (Crypto-positive)", "High_Infection_Samples", "Sequenced"),
overlayGroups = c("Eim_Infected"),
options = layersControlOptions(collapsed = F))
## Assuming "Longitude" and "Latitude" are longitude and latitude, respectively
## Assuming "Longitude" and "Latitude" are longitude and latitude, respectively
## Assuming "Longitude" and "Latitude" are longitude and latitude, respectively
## Assuming "Longitude" and "Latitude" are longitude and latitude, respectively
## Assuming "Longitude" and "Latitude" are longitude and latitude, respectively
## Assuming "Longitude" and "Latitude" are longitude and latitude, respectively
Let’s say we want to:
There are different functions for saving different types - as usual, I would recommend saving data frames as csv files
For saving pictures, there are probably even more possibilities:
My personal favorite for high resolution pictures:
Load library(cowplot), assign your plot to an object, e.g. my_Plot <- …
follow the saving path:
save_plot(my_Plot, filename = ‘my_Plot.jpg’, base_height = 5, base_width = 7.5)
change the base_height and _width according to your needs (plots with facet_wrap() tend to get quite unreadable if exported too small), I would go up to base_height 20 and base_weight 30 if you want really high resolution
beware, that the font size WILL SHRINK dramatically with increasing base_height/weight conditions!
export the picture through the viewer window (export let’s you either export as Image, PDF or copy to clipboard)
save the picture with the jpeg() function (or similar), which allows you to set the quality of the image as a percentage (I think this works best/better in normal R scripts, not in Markdown files :/ )
# JPEG device jpeg(‘Crypto_SC.jpeg’, quality = 75) plot(Crypto_SC) dev.off()
library(cowplot)
write.csv(High_Infection_Samples, 'High_Infection_Samples.csv') # don't forget to add the type of file, in this case '.csv' at the end (otherwise it will be a binary, non-readable file for R)
Crypto_SC <- Standard_Curve %>%
ggplot(aes(log2(Amount_Oocysts), ILWE_Crypto_Ct)) +
geom_smooth(method = 'lm') +
geom_point() +
xlab('Amount Oocysts (log2)') +
ylab('Ct') +
ggtitle('Cryptosporidium Standard Curve')
Crypto_SC
## `geom_smooth()` using formula 'y ~ x'
save_plot(Crypto_SC, filename = 'Crypto_SC.jpg', base_height = 5, base_width = 7.5)
## `geom_smooth()` using formula 'y ~ x'
To display a picture in markdown, you should follow this pattern:
Caption